2.2.3.1 Bayes Classification

In an ideal world, we have for each class \(c = 1,...,C\) the probability for \(x\) to belong to the \(c\)-th class. For example if we are classifying fruits, \(K_1\) could be “Banana”, \(K_2\) could be “Apple” etc When we write \(\mathbb{P}(K_c |x)\) we are asking: “Given that we observed \(x\), what is the probability that it belongs to class \(c\)?” If we denote the event of belonging to class \(c\) by \(K_c\), we know the conditional probabilities \(\mathbb{P}(x |K_c)\) (in English: conditional distributions), e.g. \(\mathbb{P}(1003g, \text{red|Banana}) = 0\%\). Thus, we can use Bayes’ theorem to obtain \[\mathbb{P}(K_c|x) = \frac{\mathbb{P}(K_c)\mathbb{P}(x|K_c)}{\mathbb{P}(x)} = \frac{\mathbb{P}(K_c)\mathbb{P}(x|K_c)}{\sum_{i=1}^C \mathbb{P}(K_i)\mathbb{P}(x|K_i)}\]

Explanation

The probability \(\mathbb{P}(K_c)\) of the \(c\)-th class is usually abbreviated as \(\pi_c\) (in English: class prior), leading to

2.2.9

\[\mathbb{P}(K_c|x) = \frac{\pi_c\mathbb{P}(x|K_c)}{\sum_{i=1}^C \pi_c\mathbb{P}(x|K_i)}\] Classification is then done by choosing the most probable class, i.e. \[ \DeclareMathOperator*{\argmax}{arg\,max} \argmax_{c=1}^C \frac{\pi_c\mathbb{P}(x|K_c)}{\sum_{i=1}^C \pi_c\mathbb{P}(x|K_i)} = \argmax_{c=1}^C \hspace{0.1cm} \pi_c\mathbb{P}(x |K_c) \] (Since the denominator is the same for all classes, we can just compute the numerator!) This means:

Example: Classifying fruits

Imagine we want to classify a fruit \(x\) based on weight and color. Suppose we have

If we observe \(x = (1003g, red)\), the probability of observing that for each fruit is

Back to the script

The mapping \(x \mapsto \argmax_{c=1}^C \pi_c \mathbb{P}(x|K_c)\) is also called the Bayes classifier. This means that given some feature \(x\), we assign it to the class \(K_c\) that maximizes \(\pi_c \mathbb{P}(x|K_c)\) Note that this is independent of the denominator in #2.2.9 Often, \(x\) is a continuous random variable, making it meaningless to work with \(\mathbb{P}(x |K_c)\): If \(x|K_c\) is, for example, normally distributed, then \(\mathbb{P}(x |K_c) = 0\) for all \(x \in \mathbb{R}^d\) In this case, however we can use Bayes’ theorem for densities and obtain

2.2.10

\[\mathbb{P}(K_c|x) = \frac{\pi_c p(x|K_c)}{\sum_{i=1}^C \pi_cp(x|K_i)}\] where \(p(x|K_c)\) are the conditional densities (\(p(x|K_c)\) now is a PDF of \(x\) given class \(K_c\)). Note that the left side still represents a probability and not a density since the possible classes form a discrete and finite probability space! In practe, we unfortunately have neither \(\pi_i\) nor \(\mathbb{P}(x |K_c)\) or \(p(x|K_c)\) available. Both quantities must be approximated using the data. To this end, one can use the class proportions: \[\hat{\pi}_c := \frac{\#\{i = 1,...,N \hspace{0.1cm} | \hspace{0.1cm} y_i = l_c\}}{N}\] This just means:

Additionally, the conditional probabilities \(\mathbb{P}(x|K_c)\) or densities \(p(x|K_c)\) must be suitably approximated by \(\hat{P}(x|K_c)\) or \(\hat{p}(x|K_c)\). - \(\hat{P}(x|K_c)\) is an approximation of the probability \(\mathbb{P}(x|K_c)\) when \(x\) is discrete (e.g. categorical features) - \(\hat{p}(x|K_c)\) is an approximation of the probability density function \(p(x|K_c)\) when \(x\) is continuous (e.g. real-valued features)

We then obtain the approximate Bayes classifier: \[x \mapsto \argmax_{c=1}^C \hspace{0.1cm}\hat{\pi}\hat{P}(x|K_c) \hspace{0.5cm} \text{or} \hspace{0.5cm} x \mapsto \argmax_{c=1}^C\hspace{0.1cm} \hat{\pi}_c \hat{p}(x|K_c)\] This means:

  1. Compute the estimated prior \(\pi_c\)+
  2. Compute the estimated probability (depending on the case discrete or continuous)
  3. Assign \(x\) to the class \(c\) that maximizes

Discrete Random Variables

If \(x |K_c\) is a discrete random variable (e.g. colors red, green, yellow), one can approximate \(\mathbb{P}(x|K_c)\) by proportions in the training data, e.g.

\[\hat{P}(\text{red}|K_c) := \frac{\#\{i |x_i = \text{red}, y_i = l_c\}}{\# \{i | y_i = l_c\}}\] This means:

Example

Screenshot 2025-01-30 at 19.05.39.png There are 3 red apples and 4 apples in total so \[\hat{P}(\text{red}|K_{\text{apple}}) = \frac{3}{4} = 0.75\] This means that 75% of apples in our training data are red

Gaussian Mixture Model

For continuous variables, a common assumption is that \(x|K_c \sim \mathcal{N}(\mu_c, \sigma_c^2)\) is normally distributed with means \(\mu_c\) and variances \(\sigma_c^2\) for \(c=1,...,C\). For simplicity, we will only consider the one-dimensional case here. We can approximate the densities \(p(x|K_c)\) for \(x \in \mathbb{R}\) by \[\hat{p}(x|K_c) := \frac{1}{\sqrt{2\pi \hat{\sigma_c}^2}} \text{exp}\biggl(-\frac{(x-\hat{\mu}_c)^2}{2\hat{\sigma_c}^2}\biggl)\] for \(c = 1,...,C\) where \[\mu_c = \frac{1}{\#\{i | y_i = l_c\}}\sum_{i |y_i = l_c}x_i \hspace{1cm} \hat{\sigma}_c^2 := \frac{1}{\#\{i | y_i = l_c\}} \sum_{i|y_i=l_c}(x_i-\mu_c)^2 \] are the mean and discrete variance of the points with label \(l_c\) - \(\mu_c\) is just the average of all feature values \(x_i\) that belong to class \(c\) - The denominator is the number of training samples in class \(c\) - The numerator sums all \(x_i\) values for class \(c\) - \(\hat{\sigma_c}^2\) is the variance estimate, which calculates the spread of values around the meaen - Each sample’s difference from the mean is squared and summed, then divided by the number of samples of training points in class \(c\)

Kernel Density Classification

If the distribution of \(x |K_c\) is more complicated, the Gaussian Mixture Model may not be a good choice. In this case, one can approximate it using a so-called kernel density estimator. We also restrict ourselves here to the case \(d=1\), i.e. the inputs are one-dimensional. We define

2.2.12

\[\hat{p}(x|K_c) := \frac{1}{\#\{i|y_i = l_c\}} \sum_{i|y_i=l_c} K_\lambda (x-x_i)\] where \(\lambda >0\) is the bandwidth, \(K_\lambda(x) := \frac{1}{\lambda}K(\frac{x}{\lambda})\) and \(K : \mathbb{R} \to \mathbb{R}\) is a non-negative function with \(\int_\mathbb{R} K(x)dx = 1\). Possible choices are the Gaussian kernel \[K(x) := (2\pi)^{-1/2}\text{exp}(-x^2/2)\] or \(K(x) := \frac{1}{2}1_{|x| \leq 1}\). However note that for the second choice, it may happen that \(\hat{p}(x|K_c) = 0\) for all \(c= 1,...,C\) which does not allow for class assignment. This approach depends on the choice of bandwidth \(\lambda\). In particular, for \(0 \lt \lambda \ll 1\), \(\hat{p}(x|K_c) \approx 0\) if \(x\) is not a training data point \(x_i\). On the other hand, if \(\lambda\) is very large, \(\hat{p}(\cdot |K_c)\) is nearly constant. It can be shown that an optimal choice of \(\lambda\) for large values of \(n \in \mathbb{N}\) takes the form \(\lambda = Cn^{-\frac{1}{5}}\). For normally distributed data \(x_i \sim \mathcal{N}(\mu, \sigma^2)\), the optimal constant is given by \(C = (\frac{4}{3})^{\frac{1}{5}}\sigma\)


Next up: 2.2.3.2 Naive Bayes Classification